Improved Indexing & Searching Throughput

نویسنده

  • Matt Crane
چکیده

Information retrieval is the process of finding relevant information in large corpora of documents based on user queries. Within the discipline there are a number of open research questions and areas. This thesis presents a systematic study into improving the speed of all aspects of an information retrieval system, without such improvements having an adverse effect on the effectiveness of that system. Several key areas of the indexing process were investigated: the effect of removing spam and correcting encoding errors at indexing time; the amount of parallelism and further improvements to the indexing process; the methods of vocabulary accumulation and collision resolution within a hash table; and as part of the indexing process, a new family of hash functions for information retrieval which exploit the properties of natural language was proposed. Search performance was also investigated by examining the effects of the spam removal on search quality. A relationship between the size of a collection and the pre-calculation of retrieval scores was discovered. Overall results indicate a 30% improvement of indexing throughput. This is accompanied by a 15% increase in search quality, whilst its speed could be increased by 25% without degrading quality. The pre-calculation of retrieval scores further improves retrieval speed by up to 3×. These results were compared against other open-source indexing systems by ATIRE when participating in the SIGIR 2015 RIGOR Workshop Reproducibility Challenge. The results of this challenge show that ATIRE is the fastest indexing system (taking half the time of the next best system), and the second fastest search system using the discovered relationship. Supervisors: Dr. Andrew Trotman & Dr. Richard O’Keefe Available: https://ourarchive.otago.ac.nz/handle/10523/6223 ACM SIGIR Forum 87 Vol. 50 No. 1 June 2016

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

IndexToolkit: an open source toolbox to index protein databases for high-throughput proteomics

UNLABELLED A software package, IndexToolkit, aimed at overcoming the disadvantage of FASTA-format databases for frequent searching, is developed to utilize an indexing strategy to substantially accelerate sequence queries. IndexToolkit includes user-friendly tools and an Application Programming Interface (API) to facilitate indexing, storage and retrieval of protein sequence databases. As open ...

متن کامل

Context based Indexing in Information Retrieval System using BST

Searching of data relevant to our query is done by information retrieval system. Keyword searching is the basic idea of this system which tries to solve the large search space problem as the documents to be searched could be of any length. This means time to search will increase with length of document. Search time will be reduced by reducing the search space. In this, we are constructing a met...

متن کامل

Efficient Concurrent Operations in Spatial Databases

As demanded by applications such as GIS, CAD, ecology analysis, and space research, efficient spatial data access methods have attracted much research. Especially, moving object management and continuous spatial queries are becoming highlighted in the spatial database area. However, most of the existing spatial query processing approaches were designed for single-user environments, which may no...

متن کامل

Searching Web Data: an Entity Retrieval Model

More and more (semi) structured information is becoming available on the Web in the form of documents embedding metadata (e.g., RDF, RDFa, Microformats and others). There are already hundreds of millions of such documents accessible and their number is growing rapidly. This calls for large scale systems providing effective means of searching and retrieving this semi-structured information with ...

متن کامل

High Throughput Modularized NLP System for Clinical Text

This paper presents the results of the development of a high throughput, real time modularized text analysis and information retrieval system that identifies clinically relevant entities in clinical notes, maps the entities to several standardized nomenclatures and makes them available for subsequent information retrieval and data mining. The performance of the system was validated on a small c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • SIGIR Forum

دوره 50  شماره 

صفحات  -

تاریخ انتشار 2016